Navigational ranking for focused crawling

ABSTRACT

Systems and methods of navigational ranking for focused crawling are disclosed. In an exemplary embodiment, a method may include using a classifier to distinguish at least one target web page from other web pages on a website. The method may also include modeling the web pages on the website by a directed graph G=(V, E), wherein each web page is represented by a vertex (V), and a link between two web pages is represented by an edge (E). The method may also include assigning each web page (u) in V is assigned a weight p(u) based on the classifier to calculate a navigational ranking indicating relevance of a web page.

BACKGROUND

Although there are a large number of websites on the Internet or WorldWide Web (www), users often are only interested in information onspecific web pages from some websites. For examples, students,professionals, and educators may want to easily find educationalmaterials, like online courses from a particular university. Themarketing department of an enterprise may want to know the evaluationsof customers, the comparison between their products and those from theircompetitors, and other relevant product information. Accordingly,various search engines are available for specific websites.

One approach to discovering domain-specific information is to crawl allof the web pages for a website and use a classification tool to identifythe desired or “target” web pages. Such an approach is only feasiblewith a large amount of computing resources, or if the website only hasfew web pages. A more efficient way to discover domain-specificinformation is known as focused crawling. One challenge of implementingefficient focused crawling is to determine the likelihood that a pagemay quickly lead to target pages.

Two well-known examples are the HITS algorithm and variations of thePageRank algorithm, such as, Personalized PageRank (PPR), and DynamicPersonalized PageRank (DPPR). These algorithms rank pages according totopic relevance or personal interests. Presumably, these algorithms maybe used in focused crawling, i.e., by setting the crawling priority of apage according to the score computed by HITS or DPPR. However, thesealgorithms each have deficiencies.

In the PageRank algorithm, a web page receives a higher rank if the webpages it is linked from have higher ranks. PPR is similar but inaddition takes into account the page relevance. The rank computed by PPRindicates the relevance of a web page to a certain topic but it is not agood measure for the “connectness” of a web page to target pages. Forexample, a terminal page (a web page with no out-going links) may have avery high rank, but it does not lead to any other pages. In addition,PageRank and its variations calculate aggregated score. This isinappropriate for focused crawling. For example, consider two web pagesA and B where web page A links to three target pages and threenon-target pages, and web page B only links to three target pages. Ifthe rank is calculated according to the PageRank model, web page A willreceive a higher rank than web page B. However, from the perspective ofcrawling, web page B should be ranked higher as it is “purer” than A andleads to target pages.

In addition, PPR and DPPR are one-directional (from ancestors tooffspring) score propagation algorithms. Hence, it is hard to identifythe hub pages. However, hub pages are often very useful in focusedcrawling because hub pages are most likely to lead to target pages.

The HITS algorithm, on the other hand, is a two-directional (betweenancestors and offspring) score propagation algorithm, and it can be usedto identify both hubs and authorities on certain topics. Intuitively,hubs are the web pages that should be identified and explored in focusedcrawling. However, the HITS algorithm has a similar problem to PageRankin that it calculates aggregated scores. In addition, in the HITSalgorithm, the target pages are used as the “seed” to form asub-structure surrounding them, and the scores are only computed forthose nodes in the sub-substructure. In focused crawling, a score shouldbe computed for every page, often far away from target pages.Accordingly, the HITS algorithm does not work well in such a case.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high-level diagram of an exemplary networked computer systemin which navigational ranking for focused crawling may be implemented.

FIG. 2 is an organizational layout for an exemplary website.

FIG. 3 is a block diagram illustrating exemplary navigational rankingfor focused crawling on a website using a static model.

FIG. 4 is a block diagram illustrating exemplary navigational rankingfor focused crawling on a website using an active model.

DETAILED DESCRIPTION

Systems and methods of navigational ranking for focused crawling aredisclosed. Exemplary embodiments of navigational ranking proactivelylook for target pages in a website by following links through web pagesthat are more likely to lead to target pages. The likelihood of a webpage leading to a target page is measured based on the link structure ofthe website. The focused crawler implementing navigational ranking candiscover most target pages by only exploring a small portion of theavailable web pages, therefore reducing the time and resources (andhence cost) needed to crawl the website.

FIG. 1 is a high-level illustration of an exemplary networked computersystem 100 (e.g., via the Internet) in which navigational ranking forfocused crawling may be implemented. The networked computer system 100may include one or more communication networks 110, such as a local areanetwork (LAN) and/or wide area network (WAN), for connecting one or morewebsites 120 at one or more host 130 (e.g., servers 130 a-c) to one ormore user 140 (e.g., client computers 140 a-c).

The term “client” as used herein (e.g., client computers 140 a-c) refersto one or more computing device through which one or more users 140 mayaccess the network 110. Clients may include any of a wide variety ofcomputing systems, such as a stand-alone personal desktop or laptopcomputer (PC), workstation, personal digital assistant (PDA), orappliance, to name only a few examples. Each of the client computingdevices may include memory, storage, and a degree of data processingcapability at least sufficient to manage a connection to the network110, either directly or indirectly. Client computing devices may connectto network 110 via a communication connection, such as a dial-up, cable,or DSL connection via an Internet service provider (ISP).

The focused crawling operations described herein may be implemented bythe host 130 (e.g., servers 130 a-c which also host the website 120) orby a third party crawler 150 (e.g., servers 150 a-c) in the networkedcomputer system 100. In either case, the servers may execute programcode which enables focused crawling of one or more website 120 in thenetworked computer system 100. The results may then be stored (e.g., bycrawler 150 or elsewhere in the network) and accessed on demand toassist the user 140 when searching the website 120.

The term “server” as used herein (e.g., servers 130 a-c or servers 150a-c) refers to one or more computing systems with computer-readablestorage. The server may be provided on the network 110 via acommunication connection, such as a dial-up, cable, or DSL connectionvia an Internet service provider (ISP). The server may be accesseddirectly via the network 110, or via a network site. In an exemplaryembodiment, the website 120 may also include a web portal on athird-party venue (e.g., a commercial Internet site) which facilitates aconnection for one or more server via a back-end link or other directlink. The servers may also provide services to other computing or dataprocessing systems or devices. For example, the servers may also providetransaction processing services for users 140.

When the server is “hosting” the website 120, it is referred to hereinas the host 130 regardless of whether the server is from the cluster ofservers 130 a-c or the cluster of servers 150 a-c. Likewise, when theserver is executing program code for focused crawling, it is referred toherein as the crawler 150 regardless of whether the server is from thecluster of servers 130 a-c or the cluster of servers 150 a-c.

The program code may execute the exemplary operations described hereinfor navigational ranking for focused crawling. In exemplary embodiments,the operations may be embodied as logic instructions on one or morecomputer-readable medium. When executed on a processor, the logicinstructions cause a general purpose computing device to be programmedas a special-purpose machine that implements the described operations.In an exemplary implementation, the components and connections depictedin the figures may be used.

In focused crawling, the program code needs to efficiently identifytarget web pages. This is often difficult to do because target web pagesare typically located “far away” from the website's home page. Forexample, web pages for university courses are on average about eight webpages away from the university's home page, as illustrated in FIG. 2.

FIG. 2 is an organizational layout 200 for an exemplary website, such asthe website 120 shown in FIG. 1. In this example, the website is auniversity website having a home page 210 with a number of links 215 a-eto different child web pages 220 a-c. At least some of the child webpages may also link to child web pages, such as web page 230, and thenweb pages 240-260, and so forth. The target web pages 270 a-c are linkedto through web page 260.

Here it can be seen that the shortest path from the university's homepage 210 (the “root”) to the target web page 270 a containing courseinformation (e.g., for CS1) is <Homepage> <Academic Division><Engineering & Applied Sciences> <Computer Sciences> <Academic> <CourseWebsites> <CS1>. According to the systems and methods described herein,a focused crawler is able to discover the target page 270 a by followingthis shortest path from the root and assigning each web page anavigational rank. Navigational rank is described in more detail below,and can be used to determine how each page is likely to lead to targetpages.

In an exemplary embodiment, the navigational rank may be determined asfollows. Assuming that a classifier is available for distinguishingtarget pages from other web pages, the web pages on a website can bemodeled by a directed graph G=(V, E). That is, each web page isrepresented by a vertex V, and a link between two web pages isrepresented by an edge E. Each web page or node u in V is assigned aweight p(u), by a classifier, to indicate the relevance of the page. Thehigher the weight, the higher the relevance (i.e., that the web page isa target). The weight can be a binary or real number dependent on theclassifier.

Given such a graph with vertex weights, for each vertex u in V, itsNavigational Rank NR(u) is calculated by the following iterativeprocess:

-   -   1. for all u, NR(u)(0)←1    -   2. t←0    -   3. for all u, NR(u)(t+1)=d*p(u)+(1−d) avg[NR(w)(t)/Ni(w)]    -   4. normalize NR(u)(t+1) such that they average to 1    -   5. if for all u, |NR(u)(t+1)−NR(u)(t)|<e, stop, and let        NR(u)=NR(u)(t+1)    -   6. otherwise, t←t+1, return to step 3

In these calculations, w represents all of the vertices pointed to by u;d is a damping factor (typically a small constant, such as 0.2); Ni(w)is the number of links pointing to w; and e is an error bound, which wasselected as 10⁻⁵ for purposes of illustration here.

In the above, steps 1 and 2 initialize the process. In step 3 thenavigational rank is computed as a linear combination of the initialrelevance rating p and a valued derived from the navigational rank ofthe neighbors computed in the last iteration. In steps 4 and 5 it isdetermined if the convergence condition has been met.

The intuition is that each node is rewarded by pointing to nodes with ahigh score and penalized by pointing to nodes with a low score, wherethe score is recursively defined. For any weight p, the above iterativeprocess typically converges, and the convergence is usually rapid.

The above process can be better understood with reference to thefollowing illustration. FIG. 3 is a website graph (G) 300 which may beused to illustrate exemplary navigational ranking calculations using astatic model, where all of the web pages from the website are downloadedto generate graph 300. In graph 300, node A is the root and nodes D andF are target web pages.

Table 1 shows the value of p, the intermediate value of NR after thefirst two iterations in the process described above, and the final NRcalculation (d=0.2). In Table 1, there are two rows for each value of t,where the upper row shows the value of NR after step 3, and the lowerrow shows the normalized value of NR after step 4.

TABLE 1 Exemplary NR Calculation A B C D E F P 0 0 0 1 0 1 t = 1 0.670.40 0.53 0.20 0.00 0.20 2.00 1.20 1.60 0.60 0.00 0.60 t = 2 0.83 0.240.16 0.20 0.00 0.20 3.05 0.89 0.59 0.74 0.00 0.74 t = ∞ 0.64 0.31 0.210.20 0.00 0.20 2.46 1.20 0.80 0.77 0.00 0.77

As can be seen from the exemplary calculations in Table 1, while nodes Dand F are the only two relevant pages, their NR values are relativelylow. Indeed, NR measures how likely a page may lead to target pages, nothow likely a web page is to be a target page.

In this example, all of the pages in the website had to first bedownloaded to obtain the graph 300 shown in FIG. 3. This model, referredto as the “static” model, is suitable where domain-specific crawling isimplemented on many similar sites. For example, the crawler mayimplement the static model to crawl course pages from all universitywebsites. Several entire websites may be downloaded and used tocalculate navigation ranks by the above procedure. Then, a machinelearning process may be invoked to discover the relation betweennavigation rank and the features of a webpage, such as its URL name,anchor texts, or content. Then, for a new website, the learned resultsare used to approximate the navigational rank of each page encounteredin the crawling process. Web pages with higher ranks are expanded duringthe crawling.

In other circumstances where it is not desired to download all of thepages in a website to obtain a graph G, such as the graph 300 shown inFIG. 3, the navigational rank may be calculated according to an “active”model. In the active model, the structure for each individual site isdetermined by dynamically adjusting the nodes' navigational rank whilecrawling the site. The navigational ranks reflect more accurately thestructure of the web site.

However, there is a problem in applying the definition of NR in thestatic model directly in the active model. That is, for every URL thathas not been downloaded, there are no “out” links since thecorresponding web pages have not been downloaded and parsed. The valueof NR of those pages will always be 0 according to the static definitionof NR, which is not useful for discerning which page is more likely tolead to target pages.

To address with this issue, additional calculations may be implementedto propagate the NR scores during crawling. Navigational rank for theactive model (designated NR′) is calculated as follows:

-   -   1. for all u, NR′(u)(0)←1    -   2. t←0    -   3. for all u, NR′(u)(t+1)=d*NR(u)+(1−d) avg[NR′(v)(t)/No(v)]    -   4. normalize NR′(u)(t+1) such that they average to 1    -   5. if for all u, |NR′(u)(t+1)−NR′(u)(t)|<e, stop, and let        NR′(u)=NR′(u)(t+1)    -   6. otherwise, t←t+1, return to step 3

In these calculations, NR′(u) is the Navigational Rank of vertex ucomputed by the first algorithm; v is all of the vertices pointing to u;and No(v) is the number of links pointing away from v.

The iteration is very similar to the process described above. Thedifference is that the direction of score propagation is reversed.Previously, the average is taken from the out-neighbors (the neighborswhich are pointed away from u); in the above process, the average istaken from the in-neighbors (the neighbors which point to u).

The active model may be implemented in focused crawling for calculatingnavigational rank in real-time (i.e., as subsets of the website aredownloaded). FIG. 4 is a website graph (G′) 400 which may be used toillustrate exemplary navigational ranking calculations using an activemodel, where subsets (e.g., subsets 410 and 420) of the web pages aresequentially downloaded from the website to generate graph 400. Again,node A′ is the root of the subset and nodes D′ and F′ are target webpages. Note that node A′ may be the home page, but does not need to bethe home page of the website. Accordingly, navigational ranking may beimplemented faster and more efficiently, even when the entire web pageis not available.

First a standard breadth first search method is used to download thefirst subset of pages 410. Then a classifier is invoked and thenavigational rank of each node in the subset of graph 400 is calculated.The crawler then downloads more web pages (e.g., the second subset 420)by following links on the web pages in the first subset 410 with highernavigational ranks until the number of new pages reaches a thresholdvalue. This threshold value may be any suitable number of web pagesbased on design considerations (e.g., processing power, desired time tocompletion, etc.). Each web page's navigational rank is thenrecalculated on the expanded graph, and the crawling process is repeated(downloading more subsets and calculating NR) until sufficient targetpages are located.

In an alternative embodiment, the relation between the navigational rankand the features of web pages may be determined, similar to the staticmodel, and the results used to guide the crawling. After the second-steppropagation, NR scores computed in the first step are distributed to thenodes following the links. If these pages are not downloaded yet, theycan be assigned higher NR scores and crawled first in the next crawlingcycle.

Exemplary embodiments of navigational ranking may also implement anaverage score, not the summation, in an iterative computation todetermine page rank. Using the previous illustration of the prior art,where web page u had two child web pages that are targets and threechild web pages that are noise, u was ranked at 5 units; and where webpage v had three child web pages and all three are targets, v was rankedat 3 units, both using a summation approach. Accordingly, web page u waserroneously selected as the target web page. According to the teachingsherein, however, web page u is ranked ⅖ using an averaging approach(i.e., two targets out of five total child web pages); and web page vwould be ranked 3/3 (or 1, i.e., three targets out of three total childweb pages), so higher than web page u. Accordingly, the navigationalranking using an averaging approach provides more accurate resultsduring a focused crawl.

Also in exemplary embodiments, navigational ranking may implement aone-direction score propagation strategy, from offspring to ancestors. Aweb page is ranked higher if it points to pages with a high score.Therefore, the hub pages can be effectively identified. Alternatively,navigational ranking may implement a two-direction and two-step scorepropagation strategy. Again, a web page is ranked higher if it points topages with a high score so that the hub pages can be effectivelyidentified. Next, the score obtained in the first step is distributedfrom ancestors to offspring. Therefore, a potential target page willlikely be crawled because it is pointed to by high scoring pages.Moreover, this two-step score propagation is more effective than theone-step used in HITS.

It is understood that the embodiments shown and described herein areintended only for purposes of illustration of exemplary systems andmethods and are not intended to be limiting. In addition, the operationsand examples shown and described herein are provided to illustrateexemplary implementations of navigational ranking for focused crawling.It is noted that the operations are not limited to those shown. Otheroperations may also be implemented. Still other embodiments ofnavigational ranking for focused crawling are also contemplated, as willbe readily appreciated by those having ordinary skill in the art afterbecoming familiar with the teachings herein.

In addition to the specific embodiments explicitly set forth herein,other aspects and implementations will be apparent to those skilled inthe art from consideration of the specification disclosed herein.

The invention claimed is:
 1. A method comprising: accessing, by acomputing device, a plurality of web pages; and determining, by thecomputing device, for each web page of the plurality of web pages, anavigational rank to a root node by, (a) determining a number of linkspointing to the plurality of web pages; (b) setting an initialnavigational rank of the web page to an initial value; (c) identifyingnavigational ranks of the web pages to which the web page points; (d)calculating a navigational rank of the web page based on a weight p(u)assigned to the web page (u) and an average of the identifiednavigational ranks of the web pages to which the web page points dividedby the determined number of links pointing to the plurality of webpages; (e) normalizing the calculated navigational ranks of the webpages so that the calculated navigational ranks average to a certainnumerical value; (f) determining whether a difference between thenormalized navigational rank and the initial navigational rank of theweb page is below a predefined error bound; (g) in response to thedetermined difference being less than the predefined error bound,setting the navigational rank of the web page to be the calculatednavigational rank; (h) in response to the determined difference beingbelow the predefined error bound; and repeating (c) to (h) for the webpage based on the calculated navigational ranks of the web pages untilthe differences between the normalized navigational ranks and theinitial navigational ranks of the plurality of web pages are each belowthe predefined error bound.
 2. The method of claim 1, wherein a higherweight p(u) corresponds to higher relevance of the web page to the rootnode.
 3. The method of claim 1, further comprising assigning each webpage of the plurality of web pages a respective weight p(u) thatcorresponds to a likelihood that the web page is a target web page ofthe root node.
 4. The method of claim 1 wherein the weight p(u) is abinary number or a real number.
 5. The method of claim 1 whereincalculating the navigational rank further comprises calculating thenavigational rank according to a static model, wherein under the staticmodel, the plurality of web pages are downloaded and the navigationalranks of the plurality of web pages are determined from the downloadedplurality of web pages.
 6. The method of claim 5, further comprisingcalculating the navigational rank according to the static model wheredomain-specific crawling is implemented on many similar sites.
 7. Themethod of claim 5, further comprising determining the navigational rankof each of the plurality of web pages according to an active model basedon the static model.
 8. The method of claim 7, wherein in the activemodel, a structure for each individual website is determined bydynamically adjusting a navigational rank of a node while crawling thewebsite.
 9. The method of claim 8, wherein the active model is used infocused crawling for calculating navigational rank in real-time assubsets of the website are downloaded.
 10. The method of claim 7,wherein determining the navigational rank according to the active modelis based on subsets of a website sequentially downloaded from thewebsite to generate a graph of the website.
 11. The method of claim 1,further comprising modeling the plurality of web pages by a directedgraph, wherein each of the plurality of web pages is represented by avertex and links between the plurality of web pages are represented byedges.
 12. The method of claim 1, wherein calculating the navigationalrank comprises using a one-direction score propagation strategy, fromoffspring to ancestor web pages.
 13. The method of claim 1, furthercomprising discovering a target page by following a shortest path fromthe root node and the navigational ranks of the web pages.
 14. Themethod of claim 1, further comprising approximating a navigational rankof each page encountering while crawling a new website, wherein a webpage from the new website having a highest navigational rank isexpanded.
 15. A system of navigational ranking for focused crawling,comprising: a processor; a non-transitory computer-readable data storagemedium storing program code executable by the processor to: access aplurality of web pages on a web site; assign each web page of theplurality of web pages a weight indicating relevance of the web page toa root web page; for each web page of the plurality of web pages,determine a navigational rank to the root web page, wherein to determinethe navigational rank of each web page to, (a) determine a number oflinks pointing to the plurality of web pages; (b) set an initialnavigational rank of the web page to an initial value; (c) identifynavigational ranks of the web pages to which the web page points; (d)calculate a navigational rank of the web page based on a weight assignedto the web page and an average of the identified navigational ranks ofthe web pages to which the web page points divided by the determinednumber of links pointing to the plurality of web pages; (e) normalizethe calculated navigational ranks of the web pages so that thecalculated navigational ranks average to a certain numerical value; (f)determine whether a difference between the normalized navigational rankand the initial navigational rank of the web page is below a predefinederror bound: (g) in response to the determined difference being lessthan the predefined error bound, set the navigational rank of the webpage to be the calculated navigational rank; (h) in response to thedetermined difference being below the predefined error bound; and repeat(c) to (h) for the web page based on the calculated navigational ranksof the web pages until the differences between the normalizednavigational ranks and the initial navigational ranks of the pluralityof web pages are each below the predefined error bound.
 16. The systemof claim 15, wherein a higher weight corresponds to higher relevance ofthe web page to the root web page.
 17. The system of claim 15, whereinthe program code is further to cause the processor to assign each webpage of the plurality of web pages a respective weight that correspondsto a relevance that the web page is a target web page of the root webpage.
 18. The system of claim 15 wherein the weight is a binary numberor a real number.
 19. The system of claim 15, wherein to calculate thenavigational rank, the program code is further to cause the processor tocalculate the navigational rank according to a static model, whereinunder the static model, the plurality of web pages are downloaded andthe navigations ranks of the plurality of web pages are determined fromthe downloaded plurality of web pages.
 20. The system of claim 15,wherein the program code is further to cause the processor to determinethe navigational rank of each of the plurality of web pages according toan active model.
 21. The system of claim 20, wherein to determine thenavigational rank according to the active model, the program code isfurther to cause the processor to determine the navigational rankingsbased on subsets of the website sequentially downloaded from the websiteto generate a graph of the website.
 22. A non-transitorycomputer-readable data storage medium on which is stored machinereadable instructions that when executed by a processor cause theprocessor to: access a plurality of web pages on a website; assign eachweb page of the plurality of web pages a weight indicating relevance ofthe web page to a root web page; for each web page of the plurality ofweb pages, determine a navigational rank to the root web page, whereinto determine the navigational rank of each web page, the processor isto, (a) determine a number of links pointing away from the plurality ofweb pages; (b) set an initial navigational rank of the web page to aninitial value; (c) identify navigational ranks of the web pages thatpoint to the web page; (d) calculate a navigational rank of the web pagebased on the initial value a weight assigned to the web page and anaverage of the identified navigational ranks of the web pages to whichthe web page points divided by the determined number of links pointingto the plurality of web pages; (e) normalize the calculated navigationalranks of the web pages so that the calculated navigational ranks averageto a certain numerical value; (f) determine whether a difference betweenthe normalized navigational rank and the initial navigational rank ofthe web page is below a predefined error bound; (g) in response to thedetermined difference being less than the predefined error bound, setthe navigational rank of the web page to be the calculated navigationalrank; (h) in response to the determined difference being below thepredefined error bound; and repeat (c) to (h) for the web page based onthe calculated navigational ranks of the web pages until the differencesbetween the normalized navigational ranks and the initial navigationalranks of the plurality of web pages are each below the predefined errorbound.
 23. The non-transitory computer-readable data storage medium ofclaim 22, wherein the instructions are to cause the processor to:calculate the navigational rank according to a static model, whereinunder the static model, the plurality of web pages are downloaded andthe navigational ranks of the plurality of web pages are determined fromthe downloaded plurality of web pages.
 24. The non-transitorycomputer-readable data storage medium of claim 23, wherein theinstructions are to cause the processor to: determine the navigationalrank of each of the plurality of web pages according to an active modelbased on the static model.